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1  Abstract 


This  final  report  summarizes  research  carried  out  under  grant  AFOSR  88-0205  and  pro¬ 
vides  an  overview  of  short-term  and  long-term  research  goals.  The  focus  of  our  research 
has  been  primarily  in  the  following  four  areais: 


1.  Built-in  self-test 

2.  Development  of  VLSI-based  multiprocessor  networks 

3.  Design  of  large  fault-tolerant  testable  RAM’s 

4.  Error  control  coding  for  developing  new  fault-tolerant  techniques 

This  report  is  organized  into  the  following  sections.  Section  2  reviews  key  results  developed 
under  the  grant  in  the  above  four  areas.  Section  3  lists  the  publications  supported  by 
AFOSR  88-0205.  Section  4  outlines  our  short-term  and  long-term  goals  for  research  in 
fault-tolerant  and  VLSI-based  systems  and  gives  our  perspective  on  the  future  of  fault- 
tolerant  computing. 


2  Summary  of  Results 

2.1  Built-In  Self- Test 

Built-in  self-test  (BIST)  haa  become  a  standard  industry-wide  test  technique.  BIST  pro¬ 
vides  a  mechanism  to  simplify  the  process  of  testing  chips  to  determine  which  ones  survived 
the  defects  introduced  in  the  manufacturing  process.  BIST  also  provides  opportunities  to 
periodically  test  systems  meant  for  high  reliability/availability/reconfigurability  and  to 
aissist  in  the  identification  of  field  replaceable  units  (FRU)  for  high  maintainability. 

An  important  issue  pertaining  to  BIST  that  we  have  considered  is  the  development  of 
a  general  framework  for  shift  register-based  test  response  compressors.  In  this  research 
we  developed  precisely  such  a  tramework  and  a  mathematical  model  based  on  algebraic 
coding  theory  for  this  general  framework.  A  distinction  of  the  formulation  is  that  it  not 
only  allows  a  uniform  model  for  analysis  of  shift  register  techniques,  but  also  allows  for 
the  development  of  new  techniques.  Our  research  in  BIST  has  evolved  in  the  following 
stages: 


1.  Coding  theory  formulation/ computation  of  aliasing  probability 


3 


2.  Anti-aliasing  techniques 

3.  Extension  to  multiple-output  circuits 

4.  Extension  beyond  symmetric  error  model 


Coding  Theory  Formulation 

A  generic  BIST  structure,  a  multiple-input  shift  register  (MISR)  is  depicted  in  Figure  1. 
The  X  marks  represent  logical  AND  operations  with  the  values  d>o  through  The  set 

of  <p  values  reflects  the  feedback  polynomial  used.  The  -I-  marks  represent  logical  XOR 
operations.  The  vector  t’o  through  tm-i  represent  the  m  outputs  of  the  circuit  for  each 
test.  For  a  single-output  circuit,  only  to  exists  and  the  BIST  structure  constitutes  a  linear 
feedback  shift  register  (LFSR).  A  LFSR  implements  division  of  the  circuit  output  sequence 
by  the  feedback  polynomial.  The  remainder  of  the  division  remains  in  the  shift  registers 
Do  through  Dm-i-  After  applying  the  test  sequence,  the  remainder  is  termed  the  signature 
of  the  circuit  and  is  available  for  comparison  against  the  signature  of  a  fault-free  circuit. 
The  quotient  of  the  division  is  represented  by  the  bits  shifted  out  from  D^-i  after  each 
test.  Together,  the  remainder  and  quotient  would  completely  represent  the  input  sequence 
to  a  LFSR,  but  since  the  quotient  is  lost,  a  fault  may  yield  a  functionally  different  circuit 
with  a  different  quotient  but  the  same  signature  as  a  good  circuit.  Such  instances  are 
referred  to  as  aliasing  and  a  major  problem  of  BIST  is  proving  the  extent  of  aliasing  for 
particular  circuits  and  BIST  structures. 


*n 


M 


*2 


'»n  —  I 


Figure  1:  Conventional  MISR  Compressor 


Conventional  methods  for  determining  the  aliasing  probability  of  a  BIST  structure  use 
Markov  models.  Such  models  have  the  advantages  of  tractability  for  simple  BIST  struc¬ 
tures.  BIST  structures,  however,  can  be  described  in  a  natural  way  in  terms  of  algebraic 
coding  theory.  We  have  developed  such  a  formulation,  which  has  the  following  advantages 
over  Markov  models: 
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1.  Markov  models  predict  the  asymptotic  aliasing  probability  as  the  length  of  the  test 
sequence  goes  to  infinity,  whereas  the  coding  theory  formulation  allows  exact  com¬ 
putation  of  the  aliasing  probability  for  any  test  sequence. 

2.  The  coding  theory  formulation  can  be  extended  to  consider  MISR’s  and  more  com¬ 
plicated  BIST  structures,  while  extension  of  Markov  models  to  more  complicated 
structures  causes  them  to  lose  their  tractability. 

3.  Choices  of  the  feedback  polynomial  and  of  the  test  pattern  generator  will  yield 
differing  aliaising  probabilities,  which  the  coding  theory  formulation  will  discern. 
Markov  models  can  be  modified  to  account  for  different  feedback  polynomials,  but 
doing  so  causes  them  to  lose  their  tractability. 

Anti-Aliasing  Techniques 

Aliasing  occurs  because  the  quotient  of  the  division  is  lost.  The  quotient  is  discarded 
because  it  is  very  nearly  the  same  length  as  the  circuit  output  sequence  (quotient  length 
equals  test  sequence  length  minus  signature  length).  The  question  naturally  arises  whether 
it  is  possible  to  apply  the  quotient  to  a  LFSR  and  produce  a  signature  for  it,  thereby 
reducing  aliasing  potentially  to  zero.  We  have  solved  this  problem.  Given  any  circuit,  any 
test  sequence,  and  any  LFSR,  we  can  obtain  a  second  LFSR  that  combined  with  the  first 
produces  a  unique  combination  of  signatures.  The  majcimum  number  of  shift  registers 
needed  for  both  divisions  is  about  half  the  test  sequence.  There  are  two  difficulties  with 
this  method; 

1.  Determining  the  feedback  polynomial  of  the  second  LFSR  is  not  computationally 
tractable,  so  it  is  generally  possible  only  for  small  circuits  and  test  sequences. 

2.  The  maximum  number  of  shift  registers  needed  is  still  unacceptably  large  and  we 
have  proven  that  most  circuits  require  near  the  maximum  number  of  shift  registers. 

Despite  these  difficulties,  researchers  at  United  Technologies  Corporation  have  reported 
that  they  have  achieved  zero  aliasing  for  a  chip  that  was  particularly  amenable  to  the 
technique. 

Multiple-Output  Circuits 

Unlike  conventional  aliasing  models,  the  coding  theory  formulation  allows  us  to  compute 
the  exact  aliasing  probability  for  a  wide  variety  of  BIST  structures.  For  circuits  with  many 
outputs,  the  cost  of  implementing  BIST  for  each  output  is  prohibitive,  .\ccordingly.  BIST 


structures  such  as  the  MISR  in  Figure  1  are  in  common  use  since  they  incorporate  entire 
output  vectors  into  a  single  shift  register. 

Other  testing  paradigms  that  do  not  involve  single  outputs  have  been  proposed.  An 
example  is  the  STUMPS  paradigm  as  depicted  in  Figure  2.  The  STUMPS  paradigm 
provides  facilities  to  test  a  number  of  chips  simultaneously.  These  chips  may  be  expected  to 
provide  identical  outputs  (50i  through  SOm  all  the  same)  as  might  be  the  case  for  testing 
after  manufacture.  Alternatively,  the  chips  may  produce  different  outputs,  as  would  be  the 
case  for  board-level  testing.  Our  general  framework  allows  exact  computation  of  aliasing 
probabilities  in  such  settings  and  provides  a  research  framework  to  determine  good  BIST 
structures  and  to  determine  good  feedback  polynomials  within  particular  BIST  structures. 


Figure  2:  Global  Test  Using  STUMPS 


Various  Error  Models 

Work  described  in  previous  sections  have  all  assumed  a  symmetric  error  model  (all  outputs 
equally  likely  given  there  is  a  fault).  But  other  error  models  may  be  more  appropriate  than 
the  (2"*-ary)  symmetric  error  model.  For  example,  the  indeoendent  error  model  (BSC) 
assumes  that  when  an  error  exists  each  output  is  affected  independently  and  with  a  given 
probability.  We  have  developed  a  very  general  error  model  that  subsumes  these  and  other 
error  models.  The  effects  of  different  error  models  have  been  considered  and  our  method 
has  been  applied  to  particular  circuits.  Figure  3  shows  the  aliasing  probability  computed 
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under  different  error  models  for  various  test  lengths.  The  subject  circuit  is  C432  of  the 
MCNC  combinational  benchmark  test  circuits.  Note  that  aliasing  probability  tends  to 
a  particular  value  eus  the  test  sequence  length  incre<ises,  cis  predicted  by  Markov  model 
methods.  The  aliasing  probability  may  significantly  differ,  however,  under  different  error 
models  when  the  test  sequence  is  shorter. 


Figure  3;  Aliasing  Probability  for  C432 


Our  general  error  model  hzis  the  following  disadvantages  versus  the  symmetric  error  model: 


1.  The  coding  theory  formulation  loses  much  of  its  computational  tractability.  This 
loss,  however,  is  due  entirely  to  the  increased  data  that  the  general  error  model 
must  consider.  Under  the  most  general  model  each  fault  must  be  considered  for 
each  test  and  the  probability  of  a  given  output  determined. 

2.  Under  the  symmetric  error  model,  BIST  methods  could  be  analyzed  and  a  good 
BIST  method  found  very  early  in  the  development  of  the  product.  The  symmetric 
error  model  only  requires  the  circuit's  functional  specification.  With  a  general  error 
model,  a  more  detailed  circuit  description  is  necessary  (at  least  as  low  as  liie  gaie- 
level  specification),  because  how  the  circuit’s  function  is  implemented  dictates  the 
effects  of  faults  on  the  outputs  and  also  the  set  of  possible  faults. 
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2.2  VLSI-Based  Multiprocessor  Networks 


We  have  studied  a  number  of  topologies  suitable  for  VLSI  implementation.  The  primary 
criteria  for  evaluating  VLSI  topologies  are  (l)  support  for  mapping  common  algorithms 
to  the  architecture  for  use  as  a  parallel  processing  machine,  (2)  short  distances  between 
nodes  (a^  reflected  by  the  graph  diameter),  (3)  ability  to  sustain  diameter  in  the  presence 
of  faults,  and  (4)  amenability  to  two-dimensional  layout  for  VLSI. 


De  Bruijn  Network 

Under  AFOSR  support,  we  were  first  to  discover  the  value  of  de  Bruijn  graphs  for  VLSI- 
based  multiprocessor  networks.  A  recent  development  that  emphzisizes  the  significance  of 
de  Bruijn  networks  is  its  use  in  the  projected  Galileo  spacecraft  for  coding  problems. 

We  have  studied  binary  de  Bruijn  graphs  (BDG)  extensively.  We  derived  a  lower  bound 
on  the  VLSI  layout  area  of  the  BDG  and  obtained  a  layout  method  to  meet  the  bound. 
We  have  shown  that  BDG’s  can  be  configured  to  match  the  data  flow  graph  of  a  large 
class  of  algorithms.  A  careful  comparison  of  BDG  with  the  hypercube  reveals  that  BDG’s 
admit  various  important  configurations  such  as  complete  binary  trees  and  one-step  shuffle- 
exchange  networks  (which  are  not  admisable  by  hypercubes).  Consequently,  the  BDG  can 
support  a  wide  variety  of  algorithms  in  addition  to  many  algorithms  suported  by  the 
binary  cube.  We  have  shown  that  the  BDG  is  the  only  known  network  that  can  sort  in 
all  known  categories  of  sorting.  Also  we  have  been  able  to  show  that  the  BDG  can  be  a 
powerful  technique  for  solving  a  wide  variety  of  graph  and  linear  algebra  applications.  We 
have  shown  that  certain  string  comparison  algorithms  can  run  efficiently  on  the  BDG. 

Shuffle-exchange  networks  are  useful  for  a  variety  of  problems  such  eis  permutation  and 
the  fast  Fourier  transform.  Trees  are  useful  for  problems  of  a  divide-and-conquer  nature 
such  as  sorting  and  parallel  prefix  operations.  Many  algorithmic  paradigms  exist  that  may 
be  described  as  graphs.  We  have  shown  that  BDG’s  support  the  most  common  paradigms 
and  therefore  form  a  quite  useful  basis  for  parallelizing  algorithms. 


Flip- Trees 

Continued  advances  in  VLSI  technology  hold  the  promise  of  very  large  distributed  systems 
where  each  node  in  the  system  is  fabricated  on  a  single  chip.  Thousands  or  even  millions 
of  processors  could  be  joined  in  such  a  system.  Massive  computational  resources,  how¬ 
ever,  imply  that  the  communication  effectiveness  of  the  system  may  be  the  weakest  link  in 
the  design.  This  holds  both  for  the  performance  of  the  system  (some  parallel  and  (espe¬ 
cially)  distributed  algorithms  require  communication  that  grows  faster  than  the  number 
of  processors)  and  for  reliability  (having  many  processors  may  allow  task  data  replica- 
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tion  without  impacting  performance,  while  increaising  the  reliance  on  the  communication 
structure). 

A  container  is  a  set  of  node-disjoint  paths  between  a  pair  of  nodes.  The  advantages  of 
containers  are  briefly  discussed  below: 


1.  By  sending  a  message  along  more  than  one  of  the  node-disjoint  paths,  the  message 
will  arrive  correctly  at  its  destination  if  a  majority  of  the  paths  has  all  nodes  non- 
faulty  (or,  when  all  faults  are  site  crashes,  if  any  one  path  htis  all  nodes  nonfaulty). 

2.  A  message  can  be  sent  along  one  path.  The  recipient  can  acknowledge  receipt  of 
the  message  by  sending  the  acknowledgement  along  paths  in  the  container.  This  is 
a  distributed  handshake.  The  acknowledgement  is  a  brief  message,  so  the  expense 
of  sending  it  along  multiple  paths  can  be  justified.  If  the  original  message  cannot 
be  altered  by  a  faulty  node  along  its  path,  then  the  distributed  handshake  problem 
reduces  to  resolving  whether  the  message  was  received  (whether  the  path  was  intact). 
We  have  developed  a  very  general  solution  to  this  problem. 

3.  Duplex  communication  between  a  pair  of  nodes  can  be  achieved,  without  congestion, 
by  assigning  two  different  paths  to  the  two  different  directions  of  communication. 

4.  Containers  admit  a  simple  fault-tolerant  distributed  routing  strategy  using  table 
look-up.  Each  node  can  maintain,  for  every  pair  of  nodes,  the  names  of  its  two 
neighbors  along  the  exclusive  node-disjoint  path  it  is  along  between  those  two  nodes 
(if  it  is  along  one  of  the  node-disjoint  paths).  This  has  the  distinct  advantage  of 
reducing  routing  information  in  messages. 


We  describe  a  family  of  graphs  called  flip-trees  that  has  two  parameters — the  degree,  d, 
and  the  diameter,  2i  —  1.  When  ^  =  1,  the  graph  degenerates  into  the  complete  graph  on 
d  4-  1  nodes,  d  is  constrained  to  be  an  integer  at  least  equal  to  three.  Let  us  consider  a 
tree  with  the  root  having  an  extra  son.  Figure  4  shows  such  a  basic  tree  of  depth  two  to 
be  used  in  constructing  a  flip-tree  of  degree  three. 

The  figure  reflects  the  labelings  we  use  for  the  nodes  of  the  graph.  The  root  is  labeled  +. 
the  sons  of  the  root  are  labeled  from  0*  to  {d  -  1)*.  The  sons  of  a  node  06162  ...  6,*  are 
06162 . .  .6,1*  through  06162  . .  .6,(d  —  1)*.  The  leaves  of  the  tree  are  labeled  in  the  form 
06162 . . .  6;_i,  where  o  is  an  integer  from  0  to  d  -  1  and  the  6,  are  integers  from  1  to  d  -  1. 
So  the  leaves  are  distinguished  from  the  other  nodes  in  the  tree  by  labels  not  ending  in 

It  remains  to  decide  how  to  interconnect  the  leaves  of  the  beisic  tree.  Flip-trees  are  con¬ 
structed  from  basic  trees  by  connecting  each  leaf,  o6i62...6;_i  to  all  leaves  a'6<-i  ...6261. 
where  o'  is  any  integer  from  0  to  d  -  1  other  than  0.  Figure  5  shows  the  flip-tree  for  d  -  .3 
and  ^  =  3. 
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Figure  4:  Basic  Tree  and  Node  Labelings 


Figure  5;  Example  Flip-Tree 


Our  main  result  hcis  been  to  show  that  flip-trees  with  parameters  d  and  £  have  a  container 
of  width  d  and  length  <  2£  -t-  1  between  every  pair  of  nodes.  Flip-trees  have  the  best 
known  containers. 

We  have  shown  that  flip-trees  are  competitive  with  respect  to  many  aspects  of  network 
topologies,  such  as  diameter  and  fault-tolerant  diameter,  as  well  as  having  the  best  known 
containers.  The  primary  areas  of  deficiency  are:  (1)  traffic  congestion  and  (2)  distributed 
routing  with  localized  routing  information. 


(1)  Roughly  [d  -  2)/  {d  -  1)  of  all  messages  must  be  routed  through  some  node  on  level 

"£/2l,  but  roughly  (d  -  of  all  nodes  are  at  level  \l;2].  As  the  diameters 

of  networks  of  interest  increase,  this  imbalance  is  exacerbated.  This  is  the  same 
level  of  congestion  that  butterfly  networks  experience  when  conducting  all-to-all 
communication. 

(2)  We  have  shown  that  topologies  such  as  the  de  Bruijn  graph  and  hypercube  are 
amenable  to  a  highly  distributed  routing  approach,  where  each  node  need  maintain 
only  the  faulty,  nonfaulty  status  information  of  nearby  nodes  by  detouring  messages 
around  faulty  nodes.  This  approach  is  not  practical  for  flip-trees,  because  many 
pairs  of  nodes  at  a  distance  of  two  from  each  other  (for  instance,  nodes  near  the 
root)  do  not  have  a  short  detour  if  their  common  neighbor  is  faulty. 


Hyper-de  Bruijn  Network 

Hypercube  and  de  Bruijn  networks  each  possess  certain  desirable  properties.  But  some 
of  the  attractive  features  of  one  network  are  not  found  in  the  other.  We  have  developed 
an  architecture,  the  hyper-de  Bruijn  (HDB)  network,  which  is  a  Cartesian  product  of 
the  hypercube  and  the  de  Bruijn  network.  Figure  6  depicts  a  16-node  binary  de  Bruijn 
network  and  Figure  7  depicts  a  16-node  HDB  network  obtained  as  a  product  of  a  4-node 
hypercube  and  a  4-node  de  Bruijn  network. 

Like  the  hypercube  and  de  Bruijn  networks,  HDB  networks  have  logarithmic  diameter. 
But  while  the  de  Bruijn  network  has  a  fixed  degree  (number  of  ports  per  node)  of  four 
and  the  hypercube  has  degree  that  grows  with  the  number  of  nodes,  the  HDB  allows 
the  designer  to  select  any  node  degree  between  four  and  the  logarithm  of  the  number  of 
nodes.  The  fixed  node  degree  of  the  de  Bruijn  network  can  be  seen  as  a  drawback  when 
one  considers  the  probability  of  a  path  existing  between  any  two  nodes  in  the  presence  of 
faults.  .\s  the  number  of  nodes  in  a  network  increases  while  the  reliability  of  each  node 
remains  constant,  the  degree  necessary  to  maintain  a  prescribed  level  of  path  reiiabilitv 
would  increase.  For  the  hypercube,  its  logarithmic  increase  in  degree  exceeds  what 
necessary  to  maintain  path  reliability. 
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Because  the  HDB  is  a  Cartesian  product,  the  complexity  of  message  routing  on  the  HDB 
is  no  more  complex  than  for  the  cube  and  the  de  Bruijn  network  (i.e.,  trivial).  Further, 
being  a  Cartesian  product,  the  HDB  is  quite  resilient  to  faults.  We  have  established 
facile  routing  algorithms  for  the  HDB  that  route  in  the  presence  of  faults.  Further,  these 
routing  algorithms  are  distributed  in  nature  —  each  node  does  not  need  to  be  aware  of 
the  good/  faulty  status  of  all  nodes  in  the  network;  each  node  need  only  be  aware  of  the 
status  of  its  immediate  neighbors. 

The  HDB  network  contains  various  computationally  important  networks  as  subgraphs: 
rings,  multidimensional  meshes,  complete  binary  trees,  meshes  of  trees,  and  others.  The 
multidimensional  meshes  are  important  in  a  variety  of  algorithms  such  «is  the  solution  of 
partial  differential  equations.  The  meshes  of  trees  are  important  to  algorithms  such  as 
matrix  multiplication. 


2.3  Large  Fault- Tolerant  Testable  RAM’s 

Description  of  TRAM 

We  have  proposed  a  new  design  to  implement  large,  fault-tolerant,  testable  RAM’s  in  VLSI. 
This  novel  design  has  been  patented  by  the  USAF.  The  design  (TRAM)  implements  the 
divide-and -conquer  concept.  A  multimegabit  RAM  is  implemented  by  dividing  the  RAM 
into  a  number  of  modules  which  are  layed  out  in  VLSI  as  the  leaves  of  a  tree.  Figure  8 
depicts  an  H-tree  layout.  H-tree  is  a  two-dimensional  tree  layout  that  occupies  about 
twice  the  area  of  the  number  of  nodes  and  four  times  the  area  of  the  number  of  leaf  nodes 
(under  the  Thompson  grid  layout  model).  An  actual  implementation  of  TRAM  would  not 
consume  quadruple  area,  because  only  the  leaf  nodes  are  large  and  the  width  of  busses 
connecting  the  nodes  is  likewise  not  large.  TRAM  heis  the  following  features: 


Figure  8:  Example  H-Tree 
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1.  Testability.  A  major  problem  of  testing  RAM’s  is  that  the  number  of  tests  required, 
even  under  simple  fault  models,  grows  faster  than  the  number  of  bits  of  memory 
implemented.  By  dividing  the  memory  into  a  number  of  modules,  the  complexity 
of  the  testing  problem  is  substantially  reduced.  In  addition,  each  module  may  be 
supplied  with  an  on-chip  test  mechanism  thereby  allowing  the  nodes  to  be  tested 
in  parallel,  reducing  test  time  further.  TRAM  is  the  first  architecture  that  yields 
practical  testing  times  for  multimegabit  RAM’s. 

2.  Performance.  For  large  RAM’s,  the  TRAM  architecture  heis  the  potential  for  reduc¬ 
ing  the  access  time  by  about  30  percent.  The  access  time  of  a  TRAM  is  dictated 
by  the  delay  in  using  the  tree  to  access  the  correct  leaf  node  (logarithmic  in  the 
number  of  leaves)  plus  the  delay  to  access  the  proper  bit  from  the  leaf  node  (grows 
as  the  square  root  of  the  number  of  bits  in  the  leaf  module)  plus  overhead  delays.  A 
conventional  RAM  does  not  experience  a  delay  to  traverse  the  tree,  but  the  (square 
root)  delay  to  access  the  proper  bit  is  much  larger,  because  TRAM  has  divided  the 
problem  into  much  smaller  modules. 

3.  Area  Overhead.  The  additional  area  overhead  for  the  TRAM  architecture  is  typically 
from  8  to  20  percent  for  a  large  RAM.  The  variation  in  overhead  is  due  to  the 
fundamental  choice  in  the  design  of  the  TRAM  —  how  many  leaf  modules  to  use. 
Choosing  how  large  to  grow  the  tree  affects  the  area  overhead,  the  access  time,  and 
the  testing  time  for  the  TRAM. 

4.  Partitionability.  The  regular  structure  of  the  TRAM  and  its  ability  to  test  leaf 
modules  independently  allow  the  manufacturer  to  determine  when  a  partially  good 
product  (e.g.,  half-size  RAM)  can  be  obtained.  This  improves  the  economic  viability 
of  manufacturing  very  large  RAM’s.  The  situation  is  similar  to  the  manufacture  of 
the  Intel  80486DX  and  80486SX  processors.  In  manufacturing  an  80486DX,  if  testing 
shows  that  the  chip  is  functional  except  for  the  floating  point  unit,  then  the  chip  can 
still  be  shipped  as  an  80486SX. 


Extended  Yield  Analysis  of  TRAM 

The  TRAM  design  has  maximal  benefits  for  very  large  RAM’s.  Sixteen  megabit  and  larger 
RAM’s  are  now  in  production  and  development.  These  memories  will  require  large-area 
VLSI  or  even  WSI  to  produce.  Conventional  IC  fabrication  yield  models  are  not  valid 
for  large-area  VLSI  and  beyond.  We  developed  the  center-satellite  yield  model  to  accom¬ 
modate  the  necessities  of  ambitious  designs.  The  center-satellite  model  provides  different 
yield  projections  than  conventional  models  for  large-area  VLSI  designs  incorporating  re¬ 
dundancy.  In  addition  to  a  fundamental  rethinking  of  the  defect  process  in  IC  fabrication, 
our  yield  model  also  directly  incorporates  well-known  anomalies  that  become  significant 
for  WSI  designs,  such  as  the  radial  dependence  of  defect  densities. 
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We  have  modelled  the  TRAM  design  for  very  large  memories  (e.g.,  16  megabits  to  1 
gigabit).  TRAM  allows  for  testing  of  individual  modules  and  reconfiguration  to  still  yield 
a  shippable  product.  Therefore  it  is  not  necessary  to  achieve  near  perfect  yield  of  each 
module.  The  existence  of  hardcore  in  each  module  does  not  permit  near  perfect  yield 
anyway.  We  have  analyzed  in  depth  the  yield  of  individual  modules  with  the  following 
four  redundancy  schemes: 


1.  Extra  columns  only.  This  is  the  weakest  scheme.  Good  yield  requires  each  module 
to  be  substantially  less  than  one  megabit. 

2.  Extra  rows  and  columns.  This  is  marginally  better  than  extra  columns  only.  Coding 
is  required  for  larger  module  sizes. 

3.  Coding.  This  has  more  hardcore,  but  with  larger  module  sizes  is  worth  it.  Coding 
alone  may  be  sufficient  depending  on  the  fabrication  line-dependent  parameters  to 
the  center-satellite  model. 

4.  Coding  with  extra  rows.  This  is  the  best  scheme.  With  current  fabrication  line 
quality,  this  scheme  can  produce  acceptable  module  yield  for  virtually  any  feasible 
module  size.  This  scheme  may  be  defeated  if  further  decrecises  in  feature  sizes  lead 
to  much  higher  defect  densities. 

With  acceptable  module  yields  (e.g.,  better  than  80  percent),  it  is  possible  to  use  the  block 
substitution  capabilities  of  TRAM.  Our  extended  yield  analysis  heis  established  the  level 
of  redundancy  required  at  each  module  to  optimize  product  yield.  For  example,  modules 
near  the  center  of  the  wafer  may  find  coding  alone  to  be  most  efficacious,  while  modules 
near  the  wafer  periphery  (and  most  susceptible  to  radial  variation)  may  find  coding  with 
extra  rows  necessary.  Current  redundancy  schemes  for  RAM’s  allow  fine  gradations  in 
redundancy  levels — e.g.,  extra  rows  may  be  added  one  at  a  time. 


2.4  Error  Control  Coding 

We  are  beginning  work  on  the  analysis  of  voting  systems  that  employ  coding.  The  principal 
emph2isis  of  the  work  is  to  determine  the  reliability  and  safety  issues  involved  and  to 
characterize  the  nature  of  the  tradeoff  between  reliability  and  safety.  We  describe  the  unit 
that  determines  the  output  of  the  system  zis  the  arbitrator.  The  arbitrator's  purpose  is  to 
determine  the  most  likely  correct  output  and  to  also  raise  a  safety  flag  when  the  doubt  on 
the  correctness  of  the  output  exceeds  a  selectable  threshold. 

Preliminary  results  have  been  obtained  for  n-modular  redundant  (nMR)  systems.  These 
results  define  and  prove  the  optimal  arbitration  policies.  W'e  have  shown  that  certain 
optimal  arbitration  policies  for  nMR  cannot  be  exceeded  (in  terms  of  reliability  and  safety) 
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by  any  arbitration  policy  in  an  (n  1)MR  system.  This  result  holds  when  the  n  outputs 
to  be  arbitrated  do  not  themselves  contain  redundancy.  Similarly  an  (n  +  2) MR  system 
always  hais  arbitration  policies  strictly  better  than  any  nondegenerate  arbitration  policy 
for  an  nMR  system.  When  any  redundancy  is  incorporated  into  the  n  module  outputs, 
(n  -t-  l)MR  then  is  guaranteed  to  exceed  nMR. 


3  Patent  and  Publications  Under  AFOSR  88-0205 


Patent: 

“Easily  Testable  High  Speed  Architecture  for  Large  RAMs,’’  U.S.  Patent  Number 
4,833,677.  Date;  May  23,  1989.  Inventors:  Najmi  T.  Jarwala  and  Dhiraj  K.  Pradhan. 
Assignee;  U.S.  Government  represented  by  the  Secretary  of  the  Air  Force,  Washington, 
DC. 

Publications: 

“The  hyper-deBruijn  multiprocessor  networks,”  IEEE  Trans.  Parallel  and  Dxstr.  Sys., 
(with  E.  Ganesan),  submitted. 

“Yield  Optimization  of  Redundant  Multimegabit  RAM’s  Using  the  Center-Satellite 
Model,”  Int.  Conf,  on  Wafer  Scale  Integration.,  (with  D.  Deis  Sharma  and  F.  Meyer), 
submitted. 

“A  theorem  on  the  fault-tolerance  of  a  modified  de  Bruijn  topology,”  J.  Discrete  Math.. 
(with  S.  Toida  and  F.  Meyer),  to  appear. 

“A  Uniform  Analysis  of  Aliasing  in  MISR  compression  for  various  error  models,”  Int.  Test 
Conf.,  (with  M.  Karpovsky  and  S.  Gupta),  October  1991. 

“A  framework  for  designing  and  analyzing  new  BIST  techniques  and  zero  aliasing  com¬ 
pression,”  IEEE  Trans.  Comput.,  vol.  40,  no.  6,  (with  S.  Gupta),  June  1991. 

“System-Level  Diagnosis:  Combining  Detection  and  Location,”  Fault  Tol.  Comput.  Symp.. 
Montreal,  pp.  488-495,  (with  N.  Vaidya),  June  1991. 

“Program  Fault  Tolerance  Based  on  Memory  Access  Behavior,”  Fault  Tol.  Comput.  Symp.. 
Montreal,  pp.  426-433,  (with  N.  Bowen),  June  1991. 

“The  Hyper-deBruijn  Multiprocessor  Networks,”  Int.  Conf.  Distr.  Comput.  Sys.,  .\rling- 
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“Consensus  with  dual  failure  modes.”  IEEE  Trans.  Parallel  and  Distr.  Sys..  vol.  2.  no.  2. 
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pp.  214-222,  (with  F.  Meyer),  April  1991. 
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Application  Specific  Array  Processors,  Princeton,  NJ,  September  1990. 

“Zero  Aliasing  Compression,”  Fault  Tol.  Comput.  Symp.,  Newcastle  upon  Tyne,  UK.  pp. 
254-263,  (with  S.  Gupta  and  S.  Reddy),  June  1990. 

“.\liasing  probability  for  a  multiple  input  signature  analyzer  and  a  new  compression  tech¬ 
nique,”  IEEE  Trans.  Comput.,  vol.  39,  no.  4,  pp.  586-591,  (with  S.  Gupta  and  M.  Kar- 
povsky),  April  1990. 

“Organization  and  analysis  of  gracefully-degrading  inter-leaved  memory  systems,”  IEEE 
Trans.  Comput.,  (with  K.  Saluja,  G.  Sohi,  and  K.  Cheung),  1989. 

“On  Implementing  Improved  Access  Control  Protocol  for  Shared  Data  Systems,”  IEEE 
Symp.  on  Parallel  and  Distr.  Comput.,  Dallas,  TX,  pp.  389-396,  (with  A.  Mendelson  and 
A.  Singh),  May  1989. 

“The  de  Bruijn  multiprocessor  networks:  A  versatile  parallel  processing  network  for  VLSI." 
IEEE  Trans.  Comput.,  vol.  38,  no.  4,  pp.  567-581,  (with  M.  Samatham),  April  1989. 

“Modeling  defect  spatial  distribution,”  IEEE  Trans.  Comput.,  vol.  38,  no.  4,  pp.  538-546. 
(with  F.  Meyer),  April  1989. 

“Dynamic  testing  strategy  for  distributed  systems,”  IEEE  Trans.  Comput.,  vol.  38,  no.  3. 
pp.  356-365,  (with  F.  Meyer),  March  1989. 

“Yield  Modeling  and  Optimization  of  Large  Redundant  RAMs,”  Int.  Conf.  on  Wafer  Scale 
Integration,  San  Francisco,  CA,  pp.  273-287,  (with  A.  Singh  and  K.  Ganapathy),  January 
1989. 

“TRAM:  A  design  methodology  for  high  performance  testable  large  RAMs,”  IEEE  Trans. 
Comput.,  vol.  37,  no.  10,  pp.  1235-1250,  (with  N.  Jarwala),  October  1988. 

“RTRAM:  Reconfigurable  and  Testable  Multi-Megabit  RAM  Design,”  Int.  Test  Conf.. 
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“Designing  interconnection  buses  in  VLSI  and  WSI  for  maximum  yield  and  minimum 
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4  Short-Term  and  Long-Term  Research  Goals 

4.1  Area-Specific  Research  Goals 


This  section  describes  opportunities  that  remain  for  further  research  in  the  four  areas 
that  have  been  the  subject  of  AFOSR  88-0205.  The  next  section  describes  our  general 
perspective  on  the  future  of  fault-tolerant  computing  and  research  opportunities. 


BIST 

The  most  important  area  for  progress  is  in  applying  our  methods  to  sequential  circuits.  .\11 
results  so  far  have  assumed  combinational  circuits.  The  testing  problem  itself  is  extremely 
difficult  for  sequential  circuits,  but  important  methodologies  currently  in  practice  such  eis 
boundary  scan  have  significantly  eased  the  testing  problem — although  not  to  acceptable 
levels.  All  models  to  determine  aliasing  in  BIST  for  sequential  circuits  are  intractable.  The 
coding  theory  formulation,  however,  holds  some  promise.  To  exploit  the  coding  theory 
formulation,  however,  may  require  totally  new  BIST  structures. 

Circuits  with  multiple  outputs  generally  have  the  effect  of  their  outputs  distributed  across 
the  BIST  structure  as  in  Figure  1.  Multiple  outputs,  however,  may  also  be  compressed 
first  with  a  combinational  circuit  to  produce  a  single  output  to  feed  the  BIST  structure. 
Output  compression  has  been  avoided,  though,  because  the  affect  on  aliasing  was  hard  to 
predict.  With  our  general  error  model,  however,  we  can  accurately  calculate  the  aliasing 
probability.  Therefore,  output  compression  should  be  reconsidered. 

The  coding  theory  formulation  provides  a  framework  to  evaluate  BIST  structures,  but 
much  work  remains  to  be  done  to  apply  our  results.  Procedures  should  be  developed  for 
popular  BIST  structures  to  determine  good  parameters  (e.g.,  feedback  polynomial)  for 
them.  Also  there  are  many  opportunities  for  novel  BIST  structures,  while  for  the  first 
time  we  have  the  tools  to  properly  evaluate  them. 


VLSI-Based  Multiprocessor  Networks 

We  continue  to  seek  network  topologies  with  excellent  diameters,  especially  in  the  pres¬ 
ence  of  faults.  We  have  discovered  the  VARSEA  topology.  When  its  properties  are  fully 
characterized,  it  is  expected  to  have  the  best  known  diameter  in  the  presence  of  faults. 
The  VARSEA  topology  is  already  known  to  have  very  facile  routing  in  its  fault-free  state. 
The  VARSEA  topology  is  node-symmetric  and  therefore  will  provide  congestion-free  com¬ 
munication. 
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Large  Fault-Tolerant  Testable  RAM's 


Our  analysis  of  the  yield  of  modules  using  the  classical  four-quadrant  architecture  is  largely 
complete.  But  large  modern  RAM’s,  such  as  IBM’s  new  16  megabit  design  do  not  use  the 
four-quadrant  architecture.  The  yield  analysis  of  TRAM  modules  should  be  extended  to 
eight-  and  sixteen-quadrant  architectures  in  order  to  analyze  the  most  ambitious  TRAM 
designs. 

In  developing  TRAM  yield  projections  we  have  assumed  that  faulty  modules  are  logically 
isolated  through  global  block  substitution,  but  global  substitution  has  a  larger  hardcore 
than  local  substitution  methods.  Unfortunately,  local  block  substitution  is  less  powerful 
and  makes  the  design  more  susceptible  to  the  spatial  autocorrelation  of  defects  that  the 
center-satellite  model  reflects.  This  tradeoff  warrants  study.  There  are  a  wide  variety  of 
local  substitution  methods,  such  ais  interstitial  redundancy,  etc.  We  intend  to  analyze  the 
merits  of  various  local  block  substitution  methods. 


Error  Control  Coding 

Most  of  our  progress  to  date  on  arbitration  policies  has  considered  modules  without  re¬ 
dundancy.  Memory  modules  in  modern  systems  would  clearly  include  redundancy  and  it 
is  also  possible  to  have  modest  levels  of  redundancy  in  arithmetic/logical  modules.  Our 
results  need  to  be  extended  to  apply  to  modules  incorporating  redundancy.  Different  types 
of  modules  A^ill  have  different  constraints  imposed  on  them.  Arithmetic/logical  modules 
have  practical  limits  on  the  within-module  redundancy  feasible,  so  modular  redundant  sys¬ 
tems  for  such  modules  would  depend  heavily  on  high  replication  of  the  modules.  Memory 
modules,  however,  can  very  efficiently  incorporate  redundancy;  further,  the  redundancy 
within  a  module  is  more  valuable  than  the  replication  of  modules,  so  memory  systems 
would  tend  to  rely  heavily  on  within-module  redundancy  with  module  replication  limited 
to  2MR  (i.e.,  a  mirroring  system).  It  is  possible  that  our  research  at  this  juncture  will 
branch  to  allow  for  an  in-depth  analysis  of  2MR  . 


4.2  Future  Research  Directions 

Under  AFOSR  88-0205  we  began  to  broaden  our  fault-tolerant  computing  emphasis  to 
explore  reliability  while  keeping  safety  issues  in  mind.  This  trend  will  continue.  Ever  larger 
systems  are  being  built  and  fault-tolerant  computing  techniques  are  being  applied  to  ever 
larger  systems.  As  a  result,  greater  attention  must  be  paid  to  a  wide  variety  of  possible 
failure  modes.  These  failure  modes  may  result  in  different  levels  of  safety  violations  and 
may  also  lead  to  degraded  systems  that  provide  different  levels  of  mission  effectiveness. 

In  addition  to  reliability  and  safety,  security  is  an  issue  for  many  systems.  Decisions  made 
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to  enhance  the  reliability /safety  tradeoff  may  have  consequences  on  the  security  of  the 
design  (and  vice  versa).  We  plan  to  develop  an  integrated  framework  that  allows  for  the 
evaluation  of  designs  in  terms  of  reliability,  safety,  and  security  criteria.  A  major  part  of 
this  effort  will  be  to  expand  our  models  of  reliability  to  allow  degraded  modes  of  operation. 
In  the  following  we  briefly  describe  two  examples  to  illustrate  the  diversity  of  systems  we 
plan  for  our  integrated  framework  to  accommodate.  The  second  example  system  also 
discusses  how  designing  for  reliability,  safety,  and  security  are  interrelated  and  motivates 
the  need  for  an  integrated  framework  to  analyze  such  systems. 


DACP  Example 

A  data  access  control  protocol  (DACP)  is  a  set  of  rules  that  specify  who  or  what  is  allowed 
access  to  sensitive  data  and  under  what  circumstances.  Gigantic  databases  can  only  be 
usefully  implemented  via  computer  systems.  Evaluating  the  effectiveness  of  DACP’s  is 
particularly  difficult  for  computer  systems.  We  list  a  few  of  the  pertinent  issues: 


1.  The  integrated  framework  must  allow  the  user  to  specify  the  meaning  of  common 
terms  such  ais  the  sensitivity  of  data  and  the  integrity  of  computer  systems  and 
human  operators.  As  an  example  of  the  difficulties  involved,  consider  that  a  com¬ 
puter  system  may  provide  programs  to  manipulate  data.  Such  programs  may  be  of 
a  general  nature,  such  as  text  processing.  The  sensitivity  of  the  manipulated  data 
has  four  components:  (l)  part  of  the  sensitivity  of  the  original  data,  (2)  part  of  the 
sensitivity  of  the  manipulating  program,  (3)  the  sensitivity  of  the  knowledge  that 
applying  the  manipulating  program  to  the  data  was  useful  (and  how  the  program 
was  applied),  and  (4)  a  component  that  reflects  security  restrictions  on  allowing  the 
program  to  manipulate  the  data. 

2.  The  common  paradigm  for  human  access  to  data  is:  clearance  ^  need  to  know 
(C-l-NTK)  =  access.  The  sheer  volume  of  sensitive  data  in  human-readable  form 
makes  such  a  simple  and  vague  paradigm  necessary.  C-t-NTK  is  not  plausible  for 
sensitive  data  in  computer  systems.  The  good  news  is  that  for  a  major  part  of  the 
access  problem  (deciding  which  programs  are  allowed  to  manipulate  which  data).  C 
is  not  a  relevant  part  of  C-i-NTK.  The  bad  news  is  that  NTK  is  too  vague  to  be 
implemented  by  computer  DACP’s.  A  very  flexible  DACP  would  allow  accesses  that 
a  human  would  consider  invalid.  An  inflexible  DACP  would  need  frequent  updating 
to  allow  clearly  needed  accesses.  An  entirely  new  paradigm  may  be  necessary  to  both 
meet  security  objectives  while  permitting  the  computer  system  to  fulfill  its  mission. 
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C^I  Example 


Consider  a  system  with  a  C^I  mission.  Our  concerns  for  such  a  system  when  it  operates 
in  an  adverse  environment  are:  (1)  (reliability)  how  effectively  it  accomplishes  its  mission, 
(2)  (safety)  whether  it  causes  acts  that  adversely  affect  friendly/neutral  forces,  and  (3) 
security.  Figure  9  shows  the  communication  connections  for  an  example  C^I  system.  The 
nodes  labelled  CP  are  higher  level  command  posts. 


1.  C^I  systems  are  amenable  to  graphical  representations.  Unit  capabilities  can  be 
represented  by  node  labellings  and  the  relationships  between  units  can  be  represented 
by  edge  labellings.  The  integrated  framework  should  allow  the  user  to  specify  the 
nature  of  the  relationships  and  to  define  the  criteria  for  the  system  effectiveness, 
safety,  and  security. 

2.  The  C^I  system  depicted  is  somewhat  robust.  In  an  adverse  environment,  even  if  one 
of  the  CP  units  is  incapacitated,  the  system  may  be  able  to  partially  fulfill  its  mission. 
To  provide  workarounds  for  such  contingencies,  however,  may  mean  disseminating 
information  widely  in  the  system.  If  designed  to  operate  only  when  not  impaired, 
then  information  can  be  centralized  at  the  CP  units,  which  can  dynamically  decide 
what  information  other  units  need.  If  designed  to  operate  even  when  impaired, 
additional  information  may  be  needed  a  priori  at  the  subordinate  units;  this  could 
have  adverse  security  consequences.  The  integrated  framework  should  reflect  such 
tradeoffs  and  support  their  analysis. 

3.  A  C^I  system  needs  to  be  able  to  initiate  conflict.  For  instance,  it  may  be  desired 
that  all  units  determine  that  the  command  (initiate,  B)  has  been  given,  where  B  is 
a  possible  battle  plan.  We  have  done  some  work  along  these  lines  under  .AFOSR 
support.  The  interactive  consistency  problem  has  the  objective  of  ensuring  that 
all  units  agree  on  the  commands  issued.  The  consequences  of  failing  to  achieve 
interactive  consistency  range  from  units  not  carrying  out  their  correct  orders  to 
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units  mistakenly  initiating  conflict.  Our  work  on  interactive  consistency  applies  to 
fully  distributed  systems.  But  C^I  systems  are  not  fully  distributed;  they  tend  to  be 
at  least  somewhat  hierarchical.  Common  interactive  consistency  protocols  involve 
substantial  communication  across  the  system;  this  may  lead  to  increased  risk  of 
message  interception. 


The  integrated  framework  should  provide  a  set  of  common  methods  that  permit  evaluation 
of  each  of  reliability,  safety,  and  security  in  depth.  The  integrated  framework  should  also 
enhance  efforts  to  study  when  decisions  to  augment  one  objective  may  impact  others.  The 
ability  to  unify  the  analysis  of  reliability,  safety,  and  security  can  point  out  when  design 
decisions  need  broader  evaluation  (because  collateral  impact  is  negative)  and  when  design 
decisions  bear  additional  merit  (because  collateral  impact  is  positive). 
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